AITopics | language documentation

Collaborating Authors

language documentation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Integrating Linguistics and AI: Morphological Analysis and Corpus development of Endangered Toto Language of West Bengal

Guha, Ambalika, Saha, Sajal, Ballav, Debanjan, Mitra, Soumi, Chakraborty, Hritwick

arXiv.org Artificial IntelligenceOct-28-2025

Preserving linguistic diversity is necessary as every language offers a distinct perspective on the world. There have been numerous global initiatives to preserve endangered languages through documentation. This paper is a part of a project which aims to develop a trilingual (Toto-Bangla-English) language learning application to digitally archive and promote the endangered Toto language of West Bengal, India. This application, designed for both native Toto speakers and non-native learners, aims to revitalize the language by ensuring accessibility and usability through Unicode script integration and a structured language corpus. The research includes detailed linguistic documentation collected via fieldwork, followed by the creation of a morpheme-tagged, trilingual corpus used to train a Small Language Model (SLM) and a Transformer-based translation engine. The analysis covers inflectional morphology such as person-number-gender agreement, tense-aspect-mood distinctions, and case marking, alongside derivational strategies that reflect word-class changes. Script standardization and digital literacy tools were also developed to enhance script usage. The study offers a sustainable model for preserving endangered languages by incorporating traditional linguistic methodology with AI. This bridge between linguistic research with technological innovation highlights the value of interdisciplinary collaboration for community-based language revitalization.

artificial intelligence, machine translation, natural language, (18 more...)

arXiv.org Artificial Intelligence

2510.22629

Country: Asia > India > West Bengal > Kolkata (0.14)

Genre: Research Report > New Finding (0.46)

Industry:

Media > Film (1.00)
Leisure & Entertainment (1.00)
Education (1.00)

Technology: Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.48)

Add feedback

Automated Quality Control for Language Documentation: Detecting Phonotactic Inconsistencies in a Kokborok Wordlist

van Dam, Kellen Parker, Stephen, Abishek

arXiv.org Artificial IntelligenceOct-27-2025

Lexical data collection in language documentation often contains transcription errors and undocumented borrowings that can mislead linguistic analysis. We present unsupervised anomaly detection methods to identify phono-tactic inconsistencies in wordlists, applying them to a multilingual dataset of Kokborok varieties with Bangla. Using character-level and syllable-level phonotactic features, our algorithms identify potential transcription errors and borrowings. While precision and recall remain modest due to the subtle nature of these anomalies, syllable-aware features significantly outperform character-level baselines. The high-recall approach provides fieldworkers with a systematic method to flag entries requiring verification, supporting data quality improvement in low-resourced language documentation.

data mining, iforest 0, machine learning, (13 more...)

arXiv.org Artificial Intelligence

2510.21584

Country:

Europe > Germany (0.14)
Asia > India (0.14)

Genre: Research Report (0.64)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (0.63)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.48)

Add feedback

Interdisciplinary Research in Conversation: A Case Study in Computational Morphology for Language Documentation

Rice, Enora, von der Wense, Katharina, Palmer, Alexis

arXiv.org Artificial IntelligenceSep-25-2025

Computational morphology has the potential to support language documentation through tasks like morphological segmentation and the generation of Interlinear Glossed Text (IGT). However, our research outputs have seen limited use in real-world language documentation settings. This position paper situates the disconnect between computational morphology and language documentation within a broader misalignment between research and practice in NLP and argues that the field risks becoming decontextualized and ineffectual without systematic integration of User-Centered Design (UCD). To demonstrate how principles from UCD can reshape the research agenda, we present a case study of GlossLM, a state-of-the-art multilingual IGT generation model. Through a small-scale user study with three documentary linguists, we find that despite strong metric based performance, the system fails to meet core usability needs in real documentation contexts. These insights raise new research questions around model constraints, label standardization, segmentation, and personalization. We argue that centering users not only produces more effective tools, but surfaces richer, more relevant research directions

artificial intelligence, computational linguistic, natural language, (17 more...)

arXiv.org Artificial Intelligence

2509.10644

Country:

Europe (1.00)
Asia > Middle East > UAE (0.46)
North America > United States > Colorado > Boulder County > Boulder (0.28)

Genre:

Questionnaire & Opinion Survey (1.00)
Research Report > New Finding (0.88)
Research Report > Experimental Study (0.66)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Human Computer Interaction (0.89)

Add feedback

Supporting SENCOTEN Language Documentation Efforts with Automatic Speech Recognition

Geng, Mengzhe, Littell, Patrick, Pine, Aidan, PENÁĆ, null, Tessier, Marc, Kuhn, Roland

arXiv.org Artificial IntelligenceJul-22-2025

The SENCOTEN language, spoken on the Saanich peninsula of southern Vancouver Island, is in the midst of vigorous language revitalization efforts to turn the tide of language loss as a result of colonial language policies. To support these on-the-ground efforts, the community is turning to digital technology. Automatic Speech Recognition (ASR) technology holds great promise for accelerating language documentation and the creation of educational resources. However, developing ASR systems for SENCOTEN is challenging due to limited data and significant vocabulary variation from its polysynthetic structure and stress-driven metathesis. To address these challenges, we propose an ASR-driven documentation pipeline that leverages augmented speech data from a text-to-speech (TTS) system and cross-lingual transfer learning with Speech Foundation Models (SFMs). An n-gram language model is also incorporated via shallow fusion or n-best restoring to maximize the use of available data. Experiments on the SENCOTEN dataset show a word error rate (WER) of 19.34% and a character error rate (CER) of 5.09% on the test set with a 57.02% out-of-vocabulary (OOV) rate. After filtering minor cedilla-related errors, WER improves to 14.32% (26.48% on unseen words) and CER to 3.45%, demonstrating the potential of our ASR-driven pipeline to support SENCOTEN language documentation.

artificial intelligence, cross-lingual transfer, machine learning, (17 more...)

arXiv.org Artificial Intelligence

2507.10827

Country: North America > Canada > British Columbia (0.28)

Genre: Research Report (0.82)

Industry: Education (0.88)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Wav2Gloss: Generating Interlinear Glossed Text from Speech

He, Taiqi, Choi, Kwanghee, Tjuatja, Lindia, Robinson, Nathaniel R., Shi, Jiatong, Watanabe, Shinji, Neubig, Graham, Mortensen, David R., Levin, Lori

arXiv.org Artificial IntelligenceJun-5-2024

Thousands of the world's languages are in danger of extinction--a tremendous threat to cultural identities and human language diversity. Interlinear Glossed Text (IGT) is a form of linguistic annotation that can support documentation and resource creation for these languages' communities. IGT typically consists of (1) transcriptions, (2) morphological segmentation, (3) glosses, and (4) free translations to a majority language. We propose Wav2Gloss: a task in which these four annotation components are extracted automatically from speech, and introduce the first dataset to this end, Fieldwork: a corpus of speech with all these annotations, derived from the work of field linguists, covering 37 languages, with standard formatting, and train/dev/test splits. We provide various baselines to lay the groundwork for future research on IGT generation from speech, such as end-to-end versus cascaded, monolingual versus multilingual, and single-task versus multi-task approaches.

dataset, transcription, translation, (15 more...)

arXiv.org Artificial Intelligence

2403.13169

Country:

North America > Canada > Ontario > Toronto (0.04)
Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.04)
Europe > United Kingdom > England > Oxfordshire > Oxford (0.04)
(12 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech (0.69)
Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)

Add feedback

Artificial Intelligence and the Spatial Documentation of Languages

Ghanim, Hakam

arXiv.org Artificial IntelligenceApr-1-2024

The advancement in technology has made interdisciplinary research more accessible. Particularly, the breakthrough in Artificial Intelligence (AI) has given huge advantages to researchers working in interdisciplinary and multidisciplinary fields. This study investigates the ability of AI models, particularly GPT-4 and GPT Data Analyst, in creating language maps for language documentation. The study Integrates documentary linguistics, linguistic geography, and AI by showcasing how AI models facilitate the spatial documentation of languages through the creation of language maps with minimal cartographic expertise. The study is conducted using a CSV file and a GeoJSON file both obtained from HDX and from the researcher's fieldwork. The study data is then applied in realtime conversations with the AI models in order to generate the language distribution maps. The study highlights the two AI models capabilities in generating high-quality static and interactive web maps and streamlining the mapmaking process, despite facing challenges like inconsistencies and difficulties in adding legends. The findings suggest a promising future for AI in generating language maps and enhancing the work of documentary linguists as they collect their data in the field, pointing towards the need for further development to fully harness AI's potential in this field. Key words: language documentation, linguistic geography, geo-linguistics, cartography, artificial intelligence, ChatGPT 1-Introduction The evolution of technology has profoundly shaped the field of language documentation, marking a journey from the humble pen and notebook to the sophisticated realms of digital mapping and artificial intelligence.

ai model, documentation, gpt data analyst, (15 more...)

arXiv.org Artificial Intelligence

2404.01263

Country:

Asia > Middle East > Iraq (0.15)
North America > Canada > Ontario > National Capital Region > Ottawa (0.04)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
(6 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.48)

Industry: Education (0.68)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

From `Snippet-lects' to Doculects and Dialects: Leveraging Neural Representations of Speech for Placing Audio Signals in a Language Landscape

Guillaume, Séverine, Wisniewski, Guillaume, Michaud, Alexis

arXiv.org Artificial IntelligenceMay-29-2023

XLSR-53 a multilingual model of speech, builds a vector representation from audio, which allows for a range of computational treatments. The experiments reported here use this neural representation to estimate the degree of closeness between audio files, ultimately aiming to extract relevant linguistic properties. We use max-pooling to aggregate the neural representations from a "snippet-lect" (the speech in a 5-second audio snippet) to a "doculect" (the speech in a given resource), then to dialects and languages. We use data from corpora of 11 dialects belonging to 5 less-studied languages. Similarity measurements between the 11 corpora bring out greatest closeness between those that are known to be dialects of the same language. The findings suggest that (i) dialect/language can emerge among the various parameters characterizing audio files and (ii) estimates of overall phonetic/phonological closeness can be obtained for a little-resourced or fully unknown language. The findings help shed light on the type of information captured by neural representations of speech and how it can be extracted from these representations

artificial intelligence, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2305.18602

Country:

Europe > Italy > Molise (0.05)
Europe > Czechia > South Moravian Region > Brno (0.04)
Europe > Netherlands > South Holland > The Hague (0.04)
(4 more...)

Genre: Research Report > New Finding (1.00)

Technology:

Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Speech > Speech Recognition (0.46)

Add feedback

SIGMORPHON 2023 Shared Task of Interlinear Glossing: Baseline Model

Ginn, Michael

arXiv.org Artificial IntelligenceMar-24-2023

Language documentation is a critical aspect of language preservation, often including the creation of Interlinear Glossed Text (IGT). Creating IGT is time-consuming and tedious, and automating the process can save valuable annotator effort. This paper describes the baseline system for the SIGMORPHON 2023 Shared Task of Interlinear Glossing. In our system, we utilize a transformer architecture and treat gloss generation as a sequence labelling task.

large language model, machine learning, natural language, (20 more...)

arXiv.org Artificial Intelligence

2303.14234

Country:

North America > United States > Colorado (0.05)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.34)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

User-friendly automatic transcription of low-resource languages: Plugging ESPnet into Elpis

Adams, Oliver, Galliot, Benjamin, Wisniewski, Guillaume, Lambourne, Nicholas, Foley, Ben, Sanders-Dwyer, Rahasya, Wiles, Janet, Michaud, Alexis, Guillaume, Séverine, Besacier, Laurent, Cox, Christopher, Aplonova, Katya, Jacques, Guillaume, Hill, Nathan

arXiv.org Artificial IntelligenceFeb-22-2021

This paper reports on progress integrating the speech recognition toolkit ESPnet into Elpis, a web front-end originally designed to provide access to the Kaldi automatic speech recognition toolkit. The goal of this work is to make end-to-end speech recognition models available to language workers via a user-friendly graphical interface. Encouraging results are reported on (i) development of an ESPnet recipe for use in Elpis, with preliminary results on data sets previously used for training acoustic models with the Persephone toolkit along with a new data set that had not previously been used in speech recognition, and (ii) incorporating ESPnet into Elpis along with UI enhancements and a CUDA-supported Dockerfile.

elpis, language documentation, speech recognition, (16 more...)

arXiv.org Artificial Intelligence

2101.03027

Country:

North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
North America > Canada > Alberta (0.14)
Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
(14 more...)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

Unsupervised Word Segmentation from Speech with Attention

Godard, Pierre, Zanon-Boito, Marcely, Ondel, Lucas, Berard, Alexandre, Yvon, François, Villavicencio, Aline, Besacier, Laurent

arXiv.org Artificial IntelligenceJun-18-2018

We present a first attempt to perform attentional word segmentation directly from the speech signal, with the final goal to automatically identify lexical units in a low-resource, unwritten language (UL). Our methodology assumes a pairing between recordings in the UL with translations in a well-resourced language. It uses Acoustic Unit Discovery (AUD) to convert speech into a sequence of pseudo-phones that is segmented using neural soft-alignments produced by a neural machine translation model. Evaluation uses an actual Bantu UL, Mboshi; comparisons to monolingual and bilingual baselines illustrate the potential of attentional word segmentation for language documentation.

machine learning, natural language, segmentation, (19 more...)

arXiv.org Artificial Intelligence

1806.06734

Country:

South America > Colombia > Meta Department > Villavicencio (0.04)
North America > United States > California > San Diego County > San Diego (0.04)
Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
(11 more...)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Machine Translation (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.46)

Add feedback